Interactive Analysis of the Iris Dataset

Authors
Affiliations

Sherlock Holmes

Department of Deductive Analytics, Baker Street University

Alice Liddell

Wonderland Institute of Pattern Recognition

Published

January 5, 2025

Abstract

The Iris dataset, introduced by Ronald Fisher in 1936, contains measurements of 150 flowers across three species: setosa, versicolor, and virginica. This paper presents an interactive exploratory analysis demonstrating how petal and sepal measurements distinguish these species. Our analysis shows that while setosa is easily separable, distinguishing versicolor from virginica requires more sophisticated approaches.

Example scroll made with Quarto for Scroll Press.

Analysis

Code
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = iris.target
df['species_name'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

Feature Relationships

Interactive exploration reveals strong correlations between petal measurements and clear separation of setosa from other species.

Code
fig = px.scatter_matrix(
    df,
    dimensions=iris.feature_names,
    color='species_name',
    labels={col: col.replace(' (cm)', '') for col in iris.feature_names},
    color_discrete_map={'setosa': '#636EFA', 'versicolor': '#EF553B', 'virginica': '#00CC96'},
    height=600, width=800
)
fig.update_traces(diagonal_visible=False)
fig.show()
(a) Pairwise feature relationships across species.
(b)
Figure 1

Dimensionality Reduction

PCA reveals that the first two principal components explain 96% of variance, enabling effective 2D visualization.

Code
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
pca_df['species'] = df['species_name']

fig = px.scatter(
    pca_df, x='PC1', y='PC2', color='species',
    title=f'PCA (Explained Variance: {sum(pca.explained_variance_ratio_):.1%})',
    labels={'PC1': f'PC1 ({pca.explained_variance_ratio_[0]:.1%})',
            'PC2': f'PC2 ({pca.explained_variance_ratio_[1]:.1%})'},
    color_discrete_map={'setosa': '#636EFA', 'versicolor': '#EF553B', 'virginica': '#00CC96'},
    height=500, width=700
)
fig.update_traces(marker=dict(size=10))
fig.show()
Figure 2: PCA projection capturing 96% of variance.

Classification

Logistic regression achieves 97% accuracy on test data, with errors concentrated in the versicolor-virginica boundary region.

Code
# Train classifier
X_train, X_test, y_train, y_test = train_test_split(X_pca, iris.target, test_size=0.2, random_state=42)
lr = LogisticRegression(max_iter=200)
lr.fit(X_train, y_train)

# Create mesh
h = 0.02
x_min, x_max = X_pca[:, 0].min() - 1, X_pca[:, 0].max() + 1
y_min, y_max = X_pca[:, 1].min() - 1, X_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = lr.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

# Plot
import plotly.graph_objects as go
fig = go.Figure()

fig.add_trace(go.Contour(
    x=xx[0], y=yy[:, 0], z=Z,
    colorscale=[[0, '#636EFA'], [0.5, '#EF553B'], [1, '#00CC96']],
    opacity=0.3, showscale=False, hoverinfo='skip'
))

for idx, name in enumerate(['setosa', 'versicolor', 'virginica']):
    mask = iris.target == idx
    fig.add_trace(go.Scatter(
        x=X_pca[mask, 0], y=X_pca[mask, 1],
        mode='markers', name=name,
        marker=dict(size=8, color=['#636EFA', '#EF553B', '#00CC96'][idx])
    ))

fig.update_layout(title='Decision Boundaries', xaxis_title='PC1', yaxis_title='PC2', height=500, width=700)
fig.show()
Figure 3: Decision boundaries in PCA space.

Conclusion

The Iris dataset demonstrates fundamental machine learning concepts: setosa is linearly separable, while versicolor and virginica overlap in feature space. Interactive visualizations reveal these relationships clearly, making this dataset an enduring pedagogical example.


Example scroll made with Quarto for Scroll Press.